This report explores collision reports in Seattle, Washington, USA.
The Seattle Collisions dataset is a compilation of over 200,000 collision reports created by Seattle Police Department (SPD) that were then recorded by Seattle Department of Transportation (SDOT), between the years 2004 and 2018.
Rows: 205380
Variables: 28
'data.frame': 205380 obs. of 28 variables:
$ ADDRTYPE : Factor w/ 4 levels "","Alley","Block",..: 3 3 3 4 4 3 3 3 4 4 ...
$ SEVERITYCODE : Factor w/ 6 levels "","0","1","2",..: 3 3 3 4 4 3 4 3 4 4 ...
$ SEVERITYDESC : Factor w/ 5 levels "Fatality","Injury",..: 3 3 3 2 2 3 2 3 2 2 ...
$ COLLISIONTYPE : Factor w/ 11 levels "","Angles","Cycles",..: 5 11 6 2 3 7 9 11 5 9 ...
$ PERSONCOUNT : int 2 3 2 4 5 3 2 2 2 2 ...
$ PEDCOUNT : int 0 0 0 0 0 0 0 0 0 0 ...
$ PEDCYLCOUNT : int 0 0 0 0 1 0 0 0 0 0 ...
$ VEHCOUNT : int 2 2 2 2 1 2 2 2 2 2 ...
$ INJURIES : int 0 0 0 1 1 0 2 0 2 1 ...
$ SERIOUSINJURIES: int 0 0 0 0 0 0 0 0 0 0 ...
$ FATALITIES : int 0 0 0 0 0 0 0 0 0 0 ...
$ INCDATE : Date, format: "2013-04-02" "2013-03-30" ...
$ INCDTTM : Factor w/ 155661 levels "1/1/04","1/1/05",..: 81567 74101 20779 125708 81577 57671 153722 114998 81578 138038 ...
$ JUNCTIONTYPE : Factor w/ 8 levels "","At Intersection (but not related to intersection)",..: 6 6 6 3 3 6 5 5 3 3 ...
$ SDOT_COLCODE : int 11 11 11 11 51 11 14 11 11 14 ...
$ INATTENTION : Factor w/ 2 levels "","Y": 1 1 1 1 2 1 2 2 1 2 ...
$ DUI : Factor w/ 3 levels "","N","Y": 2 2 2 2 2 2 2 2 2 2 ...
$ WEATHER : Factor w/ 10 levels "Blowing Sand or Dirt or Snow",..: 5 2 6 2 5 2 2 6 2 2 ...
$ ROADCOND : Factor w/ 10 levels "","Dry","Ice",..: 2 2 10 2 2 2 2 10 2 2 ...
$ LIGHTCOND : Factor w/ 9 levels "","Dark - No Street Lights",..: 6 6 6 6 7 4 6 6 7 6 ...
$ SPEEDING : Factor w/ 2 levels "","Y": 1 1 2 1 1 1 1 1 1 1 ...
$ Year : Factor w/ 15 levels "2004","2005",..: 10 10 3 2 10 3 1 2 10 2 ...
$ Month : Factor w/ 12 levels "1","2","3","4",..: 4 3 10 7 4 2 9 6 4 8 ...
$ Time : POSIXct, format: "2019-02-26 15:10:00" "2019-02-26 14:00:00" ...
$ Hour : int 15 14 10 11 19 0 12 NA 19 7 ...
$ DayPart : Factor w/ 4 levels "Afternoon","Evening",..: 1 1 3 3 2 4 1 NA 2 3 ...
$ SDOTtype : Factor w/ 7 levels "Head On","Hits Pedestrian",..: 6 6 6 6 7 6 5 6 6 5 ...
$ Season : Factor w/ 4 levels "Fall","Spring",..: 2 2 1 3 2 4 1 3 2 3 ...
ADDRTYPE SEVERITYCODE SEVERITYDESC
: 3608 : 0 Fatality : 315
Alley : 809 0 : 19559 Injury : 54377
Block :134732 1 :128276 Property Damage Only:128276
Intersection: 66231 2 : 54377 Serious Injury : 2853
2b: 2853 Unknown : 19559
3 : 315
COLLISIONTYPE PERSONCOUNT PEDCOUNT PEDCYLCOUNT
Parked Car:45855 Min. : 0.00 Min. :0.00000 Min. :0.00000
Angles :32752 1st Qu.: 2.00 1st Qu.:0.00000 1st Qu.:0.00000
Rear Ended:32404 Median : 2.00 Median :0.00000 Median :0.00000
:23778 Mean : 2.23 Mean :0.03707 Mean :0.02691
Other :22952 3rd Qu.: 3.00 3rd Qu.:0.00000 3rd Qu.:0.00000
Sideswipe :17385 Max. :93.00 Max. :6.00000 Max. :2.00000
(Other) :30254
VEHCOUNT INJURIES SERIOUSINJURIES FATALITIES
Min. : 0.000 Min. : 0.0000 Min. : 0.00000 Min. :0.000000
1st Qu.: 2.000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.:0.000000
Median : 2.000 Median : 0.0000 Median : 0.00000 Median :0.000000
Mean : 1.738 Mean : 0.3741 Mean : 0.01511 Mean :0.001641
3rd Qu.: 2.000 3rd Qu.: 1.0000 3rd Qu.: 0.00000 3rd Qu.:0.000000
Max. :15.000 Max. :78.0000 Max. :41.00000 Max. :5.000000
INCDATE INCDTTM
Min. :2004-01-01 11/2/06: 103
1st Qu.:2007-04-12 10/8/04: 98
Median :2011-01-29 10/3/08: 92
Mean :2011-03-16 11/5/05: 85
3rd Qu.:2015-02-03 1/2/04 : 80
Max. :2018-12-31 8/6/04 : 79
(Other):204843
JUNCTIONTYPE SDOT_COLCODE
Mid-Block (not related to intersection) :93480 Min. : 0.00
At Intersection (intersection related) :63603 1st Qu.:11.00
Mid-Block (but intersection related) :23747 Median :11.00
Driveway Junction :10970 Mean :13.38
:10954 3rd Qu.:14.00
At Intersection (but not related to intersection): 2419 Max. :69.00
(Other) : 207
INATTENTION DUI WEATHER
:177419 : 23758 Clear or Partly Cloudy:106169
Y: 27961 N:172648 Unknown : 38763
Y: 8974 Raining : 31724
Overcast : 26396
Snowing : 818
Other : 782
(Other) : 728
ROADCOND LIGHTCOND SPEEDING
Dry :119152 Daylight :110852 :196105
Wet : 45175 Dark - Street Lights On: 46431 Y: 9275
: 23875 : 24021
Unknown : 14750 Unknown : 13203
Ice : 1140 Dusk : 5647
Snow/Slush: 923 Dawn : 2385
(Other) : 365 (Other) : 2841
Year Month Time
2005 : 16016 10 :18967 Min. :2019-02-26 00:01:00
2006 : 15794 11 :17645 1st Qu.:2019-02-26 09:53:00
2004 : 15457 5 :17614 Median :2019-02-26 14:23:00
2007 : 15082 8 :17550 Mean :2019-02-26 13:41:57
2015 : 14260 6 :17498 3rd Qu.:2019-02-26 17:48:00
2008 : 14139 7 :17448 Max. :2019-02-26 23:59:00
(Other):114632 (Other):98658 NA's :49797
Hour DayPart SDOTtype Season
Min. : 0.00 Afternoon:52009 Head On : 7062 Fall :53586
1st Qu.: 9.00 Evening :28499 Hits Pedestrian:17846 Spring:50957
Median :14.00 Morning :39398 Non-Collision : 393 Summer:52496
Mean :13.25 Night :35677 Other Collision:30343 Winter:48341
3rd Qu.:17.00 NA's :49797 Rear End :61677
Max. :23.00 Sideswipe :86569
NA's :49797 Struck Object : 1490
The dataset consists of 28 variables for 205,370 observations.
It will be interesting to see what factors may contribute to Seattle collisions, and if they change over time.
The distribution of collision frequency by year appears bimodal, peaking at 2005 and, to a lesser extent, 2015, with a low at 2010 and again in 2018. This plot hints at a possible 5-year pattern.
I had expected to see a steady increase in Seattle collisions, mirroring population growth. Perhaps I’m wrong about the steady increase in population, or that the number of reported collisions have a strong correlation with the population in the first place.
It appears that collision counts tend to rise during commute hours, starting at the lowest point around 4am.
Our dataset includes 2 Collision type (category) variables:
COLLISIONTYPE reported by SPD (Seattle Police Department)
SDOTtype recorded by SDOT (Seattle Department of Transportation)
SPD has 11 collision categories while SDOT‘s has 7. Which type I choose to include in this study could have an impact on our analysis. For example, if we are analyzing collisions involving Pedestrians, SPD reports only 1,770 ’Pedestrian’, while SDOT reports ‘Hits Pedestrian’ 17,846 times. That’s a nearly 10x difference. Also, SDOT’s ‘Rear End’ count is almost twice as many as SPD’s.
It is possible that SPD has reason to categorize collisions differently than SDOT. The reasons are outside the scope of this project. Nonetheless, it would be interesting to look into how Collision types are recorded, and if it would make a differences in our analysis.
In this report, except where otherwise noted, will be using SDOT’s Collision categories.
Observations:
This plot orders the Severity bins in order of worsening severity.
The frequency of collisions by Severity is right skewed (towards higher severity).
Over 60% of collisions result in Property Damage Only (no injuries or deaths).
The Severity factor alone does no quantify the damage done by collisions. We could make a rough attempt to quantify damage by using columns VEHCOUNT, INJURIES, SERIOUSINJURIES and FATALATIES in a calculation, but it would be a crude quantification.
Let’s get a rough idea of how counts are distributed amongst these types of categorical factors. Don’t worry about not being able to read the x-axis labels yet :-> .
Observations:
Location (AddrType and JunctionType) each have a two predominant types. We might dig into the these two variables later, to see if JunctionType is a subset of AddrType.
Environmental conditions (Weather, LightCond, RoadCond) each have a dominant categories. Earlier we saw that most collisions happen in the middle of the day, so I suspect that is represented here (Clear, Daylight, Dry).
Time factor (Season, DayPart) frequencies a more evenly spread, with the highest frequency in Summer Afternoons.
These are the quantitative variables, which are simply counts of things. Let’s see what they are and how they’re spread out.
VEHCOUNT PERSONCOUNT PEDCOUNT PEDCYLCOUNT
Min. : 0.000 Min. : 0.00 Min. :0.00000 Min. :0.00000
1st Qu.: 2.000 1st Qu.: 2.00 1st Qu.:0.00000 1st Qu.:0.00000
Median : 2.000 Median : 2.00 Median :0.00000 Median :0.00000
Mean : 1.738 Mean : 2.23 Mean :0.03707 Mean :0.02691
3rd Qu.: 2.000 3rd Qu.: 3.00 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :15.000 Max. :93.00 Max. :6.00000 Max. :2.00000
INJURIES SERIOUSINJURIES FATALITIES
Min. : 0.0000 Min. : 0.00000 Min. :0.000000
1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.:0.000000
Median : 0.0000 Median : 0.00000 Median :0.000000
Mean : 0.3741 Mean : 0.01511 Mean :0.001641
3rd Qu.: 1.0000 3rd Qu.: 0.00000 3rd Qu.:0.000000
Max. :78.0000 Max. :41.00000 Max. :5.000000
The Counts of Things frequencies are all right-skewed. Only the VEHCOUNT and PERSONCOUNT peak at 2 instead of 0.
“The number of vehicles involved in the collision. This is entered by the state.”
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 2.000 1.738 2.000 15.000
The vast majority of collisions involve 2 vehicles. The mean, median, Q2 and Q3 are all roughly the same (2), as verified with the flat box plot.
“The total number of people involved in the collision”
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 2.00 2.00 2.23 3.00 93.00
Observations: * Most collisions involve 2 people, the next highest-count being 3 persons. * Some reportedly have zero PERSONCOUNT. * Median and Q1 are the same (2), with Q3 being just one away, then a very long tail to a maximum of 93. * There is no data available to explain the circumstances of the larger PERSONCOUNTs.
Observations:
Positive DUI and INATTENTION were reported in approximately 11-13% of collisions.
Surprisingly, SPEEDING is infrequently noted in collision reports.
Note: INATTENTION and SPEEDING have only either a Y(es) or NULL value, suggesting that the lack of value (NULL) means these are not contributing factors.
However, fewer DUI are NULL, as most are marked with either (Y)es or (N)o values. It is not apparent whether the absence of a value is an oversight, if the presence of DUI was inconclusive, or if there is some other meaning. For the rest of this report, unless otherwise stated, NULL values for DUI are not included in our analysis when considering the DUI variable.
There were 205,381 collisions reported by SPD between the years 2004 and 2018.
Most of the variables are categorical, describing:
Boolean-type variables (Yes, No, or NULL) include:
Quantitative variables include counts of:
Most collisions:
The main features of interest for me are:
Counts of Things may help explain possible SPD and SDOT collision classification disparities.
Counts of Things may also help explain Severity classifications.
Also, geographical and environmental variables may help support the investigation of main features.
Yes. I created point-in-time variables to help make reporting and graphing easier:
I created a new collision categorical variable (SDOTtype) to group common SDOT’s collision types. For example, I combined 6 types of ‘Sideswipe’ collision types into one category called ‘Sideswipe’.
I log-transformed the strongly right-skewed Counts of Things distributions (vehicles, people, injuries, etc). Both the vehicle and person counts peaked at 2 while all other counts peaked at 0. Person count, injuries and serious injuries have a long right tail. The log10 transformation made it easier to see counts other than the left-side peaks.
Other changes:
Original values if DUI included: 0, 1, Y, N and null. In order to standardize the values, I changed 0 to N and 1 to Y.
I deleted rows with Incident Dates before 2004 or after 2018. Original records outside that range contained only partial-year records.
There was a single record with a null value for SEVERITYCODE. I first verified there were no interesting anomalies in that record (like a large Fatality count), then deleted it. Otherwise, every graph would have included an additional tick on an axis, or an additional legend, just for that one record.
Original WEATHER values included both ‘Unknown’ and ‘’ (empty space). I standardized by combining both as ’Unknown’.
Original SEVERITYDESC string values were postfixed with ‘Collision’, which made labels unnecessarily long, so I removed those postfixes.
I deleted columns from the dataset that were not of interest for this study (such as report numbers and administrative codes.)
I renamed two columns for easier programming and plot reading:
INATTENTION <- INATTENTIONID (“Whether or not collision was due to inattention. (Y/N)”)
From this correlation chart, we see moderately strong relationships between:
**DUI* and …
Vehicle Count and …
Let’s view these stronger correlations in a couple different ways …
DUI VEHCOUNT SEVERITYCODE COLLISIONTYPE WEATHER
DUI 1.00 0.65 0.51 0.48 -0.57
VEHCOUNT 0.65 1.00 0.38 0.48 -0.49
SEVERITYCODE 0.51 0.38 1.00 0.24 -0.43
COLLISIONTYPE 0.48 0.48 0.24 1.00 -0.34
WEATHER -0.57 -0.49 -0.43 -0.34 1.00
PERSONCOUNT 0.38 0.55 0.37 0.25 -0.32
ROADCOND 0.27 0.23 0.14 0.18 0.32
LIGHTCOND 0.58 0.60 0.39 0.45 -0.34
PERSONCOUNT ROADCOND LIGHTCOND
DUI 0.38 0.27 0.58
VEHCOUNT 0.55 0.23 0.60
SEVERITYCODE 0.37 0.14 0.39
COLLISIONTYPE 0.25 0.18 0.45
WEATHER -0.32 0.32 -0.34
PERSONCOUNT 1.00 0.11 0.30
ROADCOND 0.11 1.00 0.29
LIGHTCOND 0.30 0.29 1.00
It is clear that DUI and VEHCOUNT have the strongest relationships.
create_summary <- function(data, col1, col2) { col1 <- enquo(col1) col2 <- enquo(col2) result_summary <- data %>% group_by(!! col1, !! col2) %>% summarise(count = n()) %>% mutate(perc = count/sum(count)) mutate(label = percent(perc %>% round(5))
result_summary }
COLLISIONTYPE and VEHCOUNT have a moderately-strong relationship (.5). Let’s see what they look like.
It appears that ‘Rear End’ collisions involve noticably more vehicles in single incidents. This chart shows percentages, not counts, so we do not know from this plot just how many collisions involve these multiple vehicles.
Interestingly, the lowest recorded DUI rate was the same year as our second-highest total collision year (2015). What happened in 2015 to cause such a drop?
However, when we plot the rate of DUI in perspective with all collisions, the DUI rate changes lose their punch and appear relatively steady.
Although INATTENTION was not strongly correlated with other factors, we did see in the Univariate section that INATTENTION has a similar rate as DUI, so let’s see how the INATTENTION rate changed over the years.
There was a significant increase in INATTENTION being a factor in collisions, starting in the year 2013. Perhaps this is a result of the increase in the use of mobile devices, or directives for the SPD to record ‘inattention’ as a factor in collisions. More data is required to do any more analysis, outside the scope of this study.
In the Univariate section, we saw a possible disparity between SPD’s reporting of collision factors and SDOT’s follow-up recording of these factors. Let’s look into this a little more.
It’s interesting that SPD reports categorizing a collision as ‘Sideswipe’, eventually gets categorized by the SDOT as ‘Other Collision’ or ‘Rear End’. Also, it appears that SPD ‘Pedestrian’ labels frequently show up in SDOT records as ‘Head On’.
View the same information as the plot above, but show the percentages of COLLISIONTYPE’s within each SDOTtype, instead of count:
Plotting by percentage makes it easier to see the spread of COLLISIONTYPEs. For example, we can now easily see that SPD ‘Sideswipe’ (yellow) are also found in SDOT ‘Hits Pedestrian’ records.
Also, we can now see that SPD reported collisions involving Cycles are spread-out over several SDOT categories.
It is possible that SDOT reclassifies Collision Types after all the dust settles and more facts are available. It would be interesting to dig a little deeper into reclassifications.
It appears that the the reportings of JUNCTIONTYPE and ADDRTYPE classifications are fairly in sync, that JUNCTIONTYPE is used as a sub-category to ADDRTYPE.
Surprisingly, it appears the % of collisions with ‘Hits Pedestrian’ rose significantly around the years where there were fewer total collisions (2010-2011). Are there more pedestrians on the road, and fewer drivers? Maybe more inattentive pedestrians getting hit? That would be an interesting further study, but outside the scope of this project since INATTENTION does not specify who was inattentive.
It will be interesting to look more into the Hits Pedistrian increase.
There are several variables that indicate a Pedestrian was involved in a collision (COLLISIONTYPE, SDOTtype, PEDCOUNT). Are all pedestrian-related collisions being marked by SDOT as ‘Hits Pedestrian’?
Let’s Bundle all records with any Pedestrian indicator into ‘Hits Pedestrian’. Will the increased trend of ‘Hits Pedestrian’ records stay the same, or even out across the years?
Here we see approximately 25% of the records with some indication of Pedestrian involvement were originally classified as ‘Head On’. This suggests that most pedestrians are hit with the front of a car, as opposed to sideswiped or rear-ended.
Comparing the original and the Adjusted plots, it appears that the ratios are pretty much stay the same, except some of the ‘Head On’ records got moved to ‘Hits Pedestrian’.
Theory: Whatever causes the SPD and SDOT discrepancies in ‘Pedestrian’ reporting appears to be consistent, and the increased rate in ‘Hits Pedestrian’ is true (we cannot reject the null hypothesis that pedestrian’s getting hit rate did not change.)
The proportions appear similar between non-DUI and DUI, in that the more serious the Severity, the fewer the collisions count, whether DUI is true or not.
What if we did look at the proportions instead of counts, will the proportions still appear the same, as suggested as in the plot above? While we’re at it, let’s compare proportions for the INATTENTION variable.
The proportions of Severity for DUI are similar but not the same. Here we see there are more serious severities in DUI collisions.
Interestingly, the rates of Severity between DUI and INATTENTION indicators are very similar, although Severity tends to be a worse in DUI collisions.
Later, let’s see if other factors come into play with the seriousness of DUI. For example, perhaps more DUI collisions occur at night, when people tend to inbibe.
Out of curiosity, how often are DUI and INATTENTION both marked as factors on the same Collision report?
Most reports with any DUI value (‘Y’ or ‘N’), INATTENTION is not marked. All reports with no DUI value, INATTENTION also has no value.
This plot was just for curiosity, I will no longer pursue INATTENTION analysis.
Light Conditions has the most significant differences between non-DUI and DUI. There are significantly more collisions in the ‘Dark - Street Lights On’ when DUI. Whether or not this is simply a factor of time (night time being when the streetlights are on and, perhaps, when more people are DUI), and less to do with the fact that the street lights are on, is a question requiring further study.
Although Weather Condition had a strong correlation (.6) with DUI, there does not appear to be a significantly difference of Weather in the non-DUI collisions. Although, it is worth noting that very few collision reports neglected to report the WEATHER (‘Unknown’) for DUI collisions. This may be another indication that SPD reports are more completely filled-out when DUI is suspected.
Collision Type of ‘Other’ is more than double when DUI is True, but Sideswipes are significantly less, as are ‘Hits Pedestrian’.
It would be interesting to break-down ‘Other’ Collision Types for more details.
Is it possible that glare from Street Lights could be a factor in DUI collisions?
(digging into this more is way outside the scope of this project)
Higher counts by LIGHTCOND at certain parts of the day hold no surprises: collisions in afternoons will naturally tend to be during Daylight.
It is interesting that most collisions occur during the afternoon. Is road volume a contributing factor?
Let’s look at the same data, in a line graph format. It may be easier to see what’s going on. Let’s also look at percentages instead of counts.
Again, no surprises here … LIGHTCOND and Hour are paired.
2015, our second-hightest ranking year, has a more distinctive peak during the lunch hours, as well as overall increase during the day. Interestingly, this same hear saw a smaller midnight collision count than the other two years.
2005, our lowest year, had a small drop in the morning commute hours that the other two years did not. Plus, there is a plateau of collision counts around the lunch hour that the other years did not have, and the spike in the evening commute is less pronounced.
Comparing our two highest incident count years, 2005 & 2015, with our lowest year 2010:
2015 had more incidents during lunch hours, as well as overall increase during the day, but fewer midnight collisions. Were more people commuting to work and venturing out less at night??
2005 had a small drop in the morning commute hours that the other two years did not see. Plus, there is a plateau of collision counts around the lunch hour that the other years did not have, and the spike in the evening commute is less pronounced. Were fewer people commuting that year?
The total number of DUI reports have changed over the years, but the rate of change is not dramatic.
In the span of the average day, there is a definite trend in that DUI’s generally occur in the evening and night hours.
DUI’s are involved in fewer ‘Sideswipe’ and ‘Pedestrian’ collision types, and more ‘Other’, suggesting more digging into ‘Other’ is needed.
DUI incidents tend to have higher severities.
There is a difference between SPD’s and SDOT’s categorization of collisions, which is interesting but appear to be consistent. It was SPD’s COLLISIONTYPE that proved to have stronger correlations with other factors, but for the sake of simplicity in my analysis, I chose to continue using SDOT’s categories in my plots.
There appeared to be anomalies when it came to SPD’s reporting of Pedestrian related collisions, and SDOT’s recordings of them. It turns out that whatever is happening between reporting and recording is consistent. It appears that SDOT categorizes some Pedestrian hits as ‘Head On’ as opposed to ‘Hits Pedestrian’.
INATTENTION increased significantly starting around 2013.
Most collisions happen in LIGHTCOND = ‘Daylight’, and during the afternoons.
I’m surprised that LIGHTCOND did not have a higher score with Hour in the correlation tables, and had a weak relationship with DayPart.
It appears that JUNCTIONTYPE is a subset of ADDRTYPE.
Weather (surprisingly, the cor value is high, but there is little difference between DUI and non-DUI weather, other than fewer DUI records are marked with WEATHER=‘Unknown’)
Light Condition (not surprisingly, DUI collisions are mostly in Street Light conditions)
In general, DUI and VEHCOUNT each have the strongest relationships with other features.
The strongest relationship was between DUI and LIGHTCOND. The LIGHTCOND of ‘Street Lights On’ more than doubles when DUI is true. However, this may very well be a ‘correlation is not causation’ example, in that DUI people may more frequently be driving at night when street lights just happen to be on.
In this section, I dig deeper into exploring relationships between:
Although DUI rates change very little over the years, they do seem to drop in rate when the total number of collisions rise.
When DUI’s happen, compared to when Injuries happen.
This plot suggests that, although the total number of collisions between midnight and 4am are at their lowest, most of these collisions can be attributed to DUIs.
Also, the number of serious injuries increases during commute hours (4-6pm), and the number of fatalities his highest around 6pm, and around 9pm.
In collisions that involved fatalities, on average most were ‘Hits Pedestrian’ and ‘Other Collision’. There were a few outliers in each of ‘Sideswipe’ and ‘Other Collision’, with more than 2x the annual average fatalities in certain years. Not surprisingly, there were the fewest fatalities in ‘Struck Object’.
More analysis would be required to figure out what ‘Other Collision’ involves. Looking at SPD’s COLLISIONTYPE helps, but they even have a significant ‘Other’ category with a significant number of ‘Other’ fatalities. But we can see here that some annual fatalities involve ‘Cycles’ and ‘Head On’:
In the Bivariate section, we explored the SPD’s Collision relationship with VEHCOUNT. Now let’s also compare VEHCOUNT with SDOT’s Collision categories, to see what the differences are, if any.
Where as the SPD tends to not categorize all the VEHCOUNT=0 reports (COLLISIONTYPE=‘’), SDOT follows these incidents up by assigning their own categorization to the incident. Also, in the VEHCOUNT=12 column (there may have been only 1 VEHCOUNT=12 record), the SPD marked it as ’Parked Car’ while SDOT followed-up and categorized the incident(s) as ‘Hits Pedestrian’. There are a number of other inconsistencies that would require further analysis outside the scope of this project.
The DUI / Light Condition / Severity relationship is strong: DUI collisions tend to happen when ‘Street Lights - On’, and they also tend to have higher severities.
Also, DUI / Time of Day / Severity relationships are strong, in that LIGHTCOND and Time are strongly paired.
It appears that if there is an incident between approximately 12am and 4am, it will most likely involve a DUI.
Looking averages across the years, reports suggest that most fatalities occur with ‘Hits Pedestrian’ and ‘Other’.
The disparity between SPD and SDOT collision categorization is made clearer when looking at Fatality averages. It is unclear why SDOT categorizes some ‘Other’ collisions when SPD categorizes them as ‘Head On’, and some of those may involve pedestrians.
I am surprised to see that (generally) when annual DUI rates are up, total collision incident counts are down.
I wanted to introduce the Seattle Collisions dataset with a simple visual illustrating collision incident trends over time (across the Years, and through the course of an average day.)
The By Year plot shows the presence of a repeated rise-and-fall trend, highlighting 2 peak years (2005 and 2015) and 1 valley year (2010).
The By Time of Day presents a simple trend of collision frequencies throughout any given day, with a trend of increasing incidents from morning until afternoon commute hours.
I chose to use a 2-column grid, as opposed to a 1-column/2-row grid, so the y-axis would be taller. I did this because the range of incident counts (y-axis) is fairly large (up to 15,000), and wanted the plot to represent the large change in counts from year to year. If I used a 1-column grid, the difference in years would presented as subtle. If I had used the more subtle chart, I felt the need to print a rate changes on each bin to show the differences, which proved to be too distracting.
I debated whether to include a point-line in the By Time of Day plot, to show where points actually sat in relation to the smoother slope. I decided to keep the plot simple, as it’s purpose was to simply show the general trend of collision incidents throughout the day.
I chose a grey color scheme to keep the charts simple. I did not want colors to distract from the main point (the trends). However, I did vary a scale of grey on the 2-highest and 1-lowest incident count By Years to draw the audience’s attention to those extremes.
I chose this plot because it shows the disparity between how the SPD (Seattle Police Department) and SDOT (Seattle Department of Transportation) choose to categorize each incident. For example, the left two bins show that when SDOT categorizes an incident as ‘Heads On’, the SPD had already categorized it as ‘Pedestrian’. The SPD and SDOT had the same classification for very few of the ‘Heads On’ and ‘Pedestrian’ labels. Less than 50% of the time they agreed on ‘Rear Ended’ classifications.
It is possible that both systems of categorization are coordinated intentionally, that the absense of matching values is not of disagreement but that by using both we get better picture of each incident. It may also suggest the dataset could be enhanced if each category was a boolean factor, as opposed to two single-value factors.
I chose bright colors (from a colorblind-friendly color scheme) to emphasize the SPD vs SDOT disparities. I hope the audience’s first impression to see the presense of a complication, and interesting enough to spend a few moments comparing the SPD vs SDOT incident classifications.
I chose this plot to illustrate the increased rate of injury and death in DUI incidents versus non-DUI incidents. DUI incidents have a 10 times greater chance of resulting in Fatalities, and more than 3 times greater chance of Serious Injuries.
I debated whether or not to use points in this plot, since the number of DUI incidents are far fewer than non-DUI, therefore resulting in fewer points on the plot (even though the top three rates are higher). I decided to include the points because I wanted to demonstrate that, even though there are fewer DUI incidents, their rate of fatalities and injuries are much higher.
I also removed the color legend because it would have been redundant with the y-axis labels.
I wanted to challenge myself with this project by finding my own dataset to use.
I initially thought the Seattle Collisions was clean. While exploring the data, I progressively realized that was not quite true. I discovered the hard way that cleaning does take significant time (as warned in the project instructions!) Not just the time for writing code, but the time it takes to dig deeper into the data in order to make ethical decisions on how best to clean while avoiding the unintentional misrepresention of data.
Also, this dataset has very few quantitative factors, making it challenging to create a variety of graph types. That being said, I learned more about plotting in R than I may not have otherwise. I learned quite a bit by experimenting with color, scale, labels, and which plot types to use with which data type.
I enjoyed the ‘running commentary, stream of thought’ nature of this data exploration project. By not strategizing too much ahead of time exactly what I wanted to conclude at the end, I made a few surprising discoveries.
I was surprised that annual collision counts did not steadily rise over the years, but has a distinct rise/fall pattern. Additional analysis in future work should introduce Seattle Population estimates, to see how it correlates with collision data, and patterns of incident factors over time.
I was not surprised there is an increase in the rate if ‘Inattention’ collisions, given the rise in the use of mobile devices. This may also be an interesting study for future work. What would ‘Innattention’ patterns reveal with more study of how (and how often) mobile devices are used by car occupants as well as pedestrians and cyclists.
Looking into the SPD vs SDOT categorizations made it more real to me how important it is to know how data is collected and stored before it even gets to the analyst, otherwise it’s difficult to make any ascertions about the data and what it represents. For example, are the SPD and SDOT categorizations of each incident a coordinated effort to describe an incident, or is data collection inconsistent or incomplete?